Statistical mechanics of transcription-factor binding site discovery 

using Hidden Markov Models 

Pankaj MehtC 
Dept. of Physics, Boston University, Boston, MA 

o 

OO 
(N 



B 



X 



David J. Schwa ql 
.Depi. 0/ Molecular Biology and Lewis-Sigler Institute, 



^ I Princeton University, Princeton, NJ 

■ n 

S Anirvan M. Senguptacl 

BioMAPS and Dept. of Physics, Rutgers University, Piscatway, NJ 

Abstract 

Hidden Markov Models (HMMs) are a commonly used tool for inference of transcription factor 
(TF) binding sites from DNA sequence data. We exploit the mathematical equivalence between 

O' 

CJ . HMMs for TF binding and the "inverse" statistical mechanics of hard rods in a one-dimensional 

[ disordered potential to investigate learning in HMMs. We derive analytic expressions for the Fisher 

information, a commonly employed measure of confidence in learned parameters, in the biologically 
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mechanics to derive a scaling principle relating the specificity (binding energy) of a TF to the 
I minimum amount of training data necessary to learn it. 
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I. INTRODUCTION 



Biological organisms control the expression of genes using transcription factor (TF) pro- 
teins. TFs bind to regulatory DNA segments (6-20bp) called binding sites thereby controlling 
the expression of nearby genes. An important task in Bioinformatics is identifying TF bind- 
ing sites from DNA sequence data. This poses a non-trivial pattern recognition problem, 
and many computational and statistical techniques have been developed towards this goal. 
The goal of these algorithms is to identify new binding sites starting from a known collection 
of TF binding sites. Many different types of algorithms exist including Position Weight Ma- 
trices (PWMs) jl, 2|, biophysics- inspired alogrithms {2, 3], Hidden Markov Models (HMMs) 
4|t6|], and information theoretic algorithms 7]. 



In general, only a limited number of binding sites are known for a given TF. Thus, 
any algorithm must build a general classifier based on limited training data. This places 
constraints of the type of algorithms and classifiers that can be used. The end goal of all 
models is generalization-the ability to correctly categorize new sequences that differ from 
the training set. This is especially important since the training set is comprised of a small 
sample fraction of all possible sequences. Most algorithms create a (often probabilistic) 
model for whether a particular DNA sequence is a binding site. The model contains a set 
of parameters, 8, that are fit, or learned, from training data. 

All algorithms exploit the statistical differences between binding sites and background 
DNA in order to identify new binding sites. Two distinct factors contribute to how well one 
can learn 9, the size of the training data set and the specificity of the TF under consideration. 
Many TFs are highly specific. Namely, they bind strongly only to small subset of all possible 
DNA sequences which are statistically distinct from background DNA. Physically, this means 
that these TF have large binding energies for certain sequence motifs (binding sites) and 
low binding energies for random segments of DNA, i.e. "background" DNA. Other TFs are 
less specific and often exhibit non-specific binding to random DNA sequences. In this case, 
the statistical signatures that distinguish binding sites from background DNA are less clear. 
In general, the more training data one has and the more specific a TF, the easier it is to 
learn its binding sites. 

This raises the natural question: how much data is needed to train an algorithm to learn 
the binding sites of a TF? In this paper, we explore this question in the context of a widely- 



used class of bioinformatic methods termed Hidden Markov Models (HMMs). We exploit 
the mathematical equivalence between HMMs for TF binding and the "inverse" statistical 
mechanics of hard rods in a one- dimensional disordered potential to derive a scaling principle 
relating the specificity (binding energy) of a TF to the minimum amount of training data 
necessary to learn its binding sites. Unlike ordinary statistical mechanics where the goal is 
to derive statistical properties from a given Hamiltonian, the goal of the " inverse " problem 
is to learn the Hamiltonian that most likelygave rise to the observed data. Thus, we are led 
to consider a well-studied physics problem [8y the statistical mechanics of a one- dimensional 
gas of hard rods in an arbitrary external potential-from an entirely new perspective. 

The paper is organized as follows. We start by reviewing the mapping between HMMs 
and the statistical mechanics of hard rods. We then introduce the Fisher Information, a 
commonly employed measure of confidence in learned parameters, and derive an analytic 
expression for the Fisher information in the dilute binding site limit. We then use this 
expression to formulate a simple criteria for how much sample data is needed to learn the 
binding sites of a TF of a given specificity. 



II. HMMS FOR BINDING SITE DISCOVERY 



HMMs are powerful tools for analyzing sequential data |9|, [10[ that have been adapted to 
binding site discovery {5, 6]. HMMs model a system as a Markov process on internal states 
that are hidden and cannot be observed directly. Instead, the hidden states can only be 
inferred indirectly through an observable state-dependent output. In the context of binding 
site discovery, HMMs serve as generative models for DNA sequences. A DNA sequence is 
modeled as a mixture of hidden states-background DNA and binding sites-with a hidden 
state-dependent probability for observing a nucleotide (A, T, C, G) at a given location (see 
Figure [XT]) . 

For concreteness, consider a TF whose binding sites are of length I. An HMM for discov- 
ering the binding sites can be characterized by four distinct elements (see Fig. 1) jlO|: 

1. / + 1 hidden states with state corresponding to background DNA and states j — 1 . . . I 
corresponding to position j of a binding site. 

2. 4 observation symbols corresponding to the four observable nucleotides a = A,T, C, G. 

3. The transition probabilities, {a^} ( i, j — . . . I) between the hidden states which take 



the particular form shown in Figure HT1 with only fl^j+i, a;o, and aoi non-zero. In addition, 
for simplicity, we assume that binding sites cannot touch (i.e. an = 0.) The generalization 
to the case where the last assumption is relaxed is straightforward. 

4. The observation symbols probabilities {bj(a)} for seeing a symbol a = A, C, G, T in a 
hidden state j. Often we will rewrite these probabilities in more transparent notation with 
Pa = bo(a), the probability of seeing base a in the background DNA, and , pff 1 = bj(a) the 
probability of seeing base a at position j in a binding site. 

Finally, denote the collection of all parameters of an HMM (a^ and bi(a)) by the symbol 9. 

A DNA sequence of length L, S = S1S2 ■ ■ ■ sl, is generated by an HMM starting in a 
hidden state q% as follows. Starting with i — 1, choose Sj according to b qi (si) and then 
switch to a new hidden state using the switching probabilities a qi q i+i and repeat this 
process until i = L. In this way, one can associate a probability, p(S\9), to each sequence 
iS, corresponding to the probability of generating S using an HMM with parameters 9. The 
goal of bioinformatic approaches is to learn the parameters 9 from training data and use 
the result to predict new binding sites. Many specialized algorithms, often termed dynamic 
programming in the computer science literature, have been developed to this end 
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A. Mapping HMMs to the statistical mechanics of hard rods 

Before discussing the mapping between HMMs and statistical mechanics, we briefly re- 
view the physics of a one-dimensional gas of hard rods in a disordered external field js]. The 
system consists of hard rods-one-dimensional hard core particles-of length / in a spatially 
dependent binding energy E(S Xi ), with Xi the location of the starting site, at a inverse tem- 
perature (3, and a fugacity, z (i.e. chemical potential /1 = log z). The equilibrium statistical 
mechanics of the system is determined by the grand canonical partition function obtained 
w summing over all possible configurations of hard rods obeying the hard-core constraint 
8] . In addition, the pressure can be calculated by taking the logarithm of the grand canon- 
ical partition function. Since this model is one-dimensional and has only local interactions, 
many statistical properties can be calculated exactly using Transfer Matrix techniques. Con- 
sequently, variations of this simple hard rod model have been used extensively to model the 



sequence dependence of nucleosome positioning 



12|. 



We now discuss the mapping between HMMs and a gas of hard rods. We start by 




FIG. 1. Hidden Markov Model for binding sites of size I. (Top) There are I + 1 hidden states, with 
state background DNA and state j corresponding to position j = 1 . . . I in a binding site. The 
HMM is described by a Markov process with transition probabilities give by aij. (Middle) Each 
state j in an HMM is characterized by an observation symbol probability bj(k), for the probability 
of seeing symbol k = A, T, C, G in a state j (Bottom) A given sequence of DNA is composed of 
binding sites and background DNA. 

showing that the observation symbol probabilities bj(a) have a natural interpretation as 
a binding energy. Consider a DNA sequence S = Si . . . sj, with / the length of a binding 
site. Denote the corresponding hidden state at the j-th position of S by qj. It is helpful to 
represent this sequence by a I by 4 matrix Sj a of DNA of length I where Sj a = 1 if base 
Si = a and zero otherwise. Denote the probability of generating S from background DNA 
as P(S\{qj = 0, j = 1... I}, 9) = Y[ l j=ibo{ s j), an d the probability of generating the same 
sequence within a binding site is P(S\{qj = j, j = 1 . . . /}, 0) = Ylj bj(sj). Note that we can 
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rewrite the ratio of these probabilities as 

P(S\{q j =j,j = l...l},6) Tjbtis 



P(S\{q j = 0,j = l...l},6) Y&o( Sj 



where we have defined a "sequence-dependent" binding energy 

E(S) =e- S^Y.^ ( 2 ) 

aj 

with 

— GS)~* (?)■ 

Notice that the ratio Eq (JTJ) is of a Boltzmann form with a 'binding energy' that can be 
expressed in terms of a Position Weight Matrix (PWM), e, related to the observation symbol 
probabilities Eq. (j3J). 

Now consider a sequence S = Sis 2 ■ ■ . sl of length L ^> I. In this case, the probability 
of generating the sequence, P(S\9), is obtained by summing over all possible hidden state 
configurations. Notice that we can uniquely denote a hidden state configuration by specifying 
the starting positions within the sequence S of all the binding sites, {x\ . . .x n }. The hard- 
rod constraint means that the only allowed configuration are those where \x u — x v \ > / + 1 
for all u,v (the extra factor of 1 arises because aio = 0). Consequently, the probability of 
generating a sequence S is given by summing over all possible hidden state configurations 

P(S\9) = J2J2 ^l^ 1 • • • °) p ^ ■ ■ ■ x n}\0)- (4) 

n x\...x n 

where P ({x\ . . . x n }\6) is the probability of generating an allowed hidden state configuration, 
{xi, . . . , x n } and we have factorized the probability using the fact that in a HMM, transition 
probabilities are independent of the observed output symbol. Furthermore, the ratio of 
P({xi . . . x n }\6) to the probability of generating a hidden-state configuration with no binding 
sites, F(0|0) is just 

P({x l ...x n }\6) _ 

pW) [) 

with the 'fugacity', z, given by 

a i 



(l-a iy+ r 

and \i = logz the chemical potential. Combining Eqs. flTJ), (JSJ), and (jlj) yields 

P(S\6) 



(6) 



C(S,0) z ^ ^ 



with 

L/l 

Z(S\0) = J2J2 e-X*i..*n E &> z " (8) 

n=0 XI. ..x n 

and 

C(S,6) = afr 1 l[ifr (9) 

i,a 

where E(xi) is the binding energy, Eq. (pQ), for a sub-sequence of length I starting at position 
Xi of S. 

Notice that Z{S\6) is the grand canonical partition function for a classical fluid of hard 



rods in an external potential 
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The sequence-dependence PWM e acts as an arbitrary 
external potential, and the switching rate aoi sets the chemical potential for binding. Thus, 
up to a multiplicative factor C(S,6) that is independent of the emission probabilities for 
binding sites, an HMM is mathematically equivalent to a thermodynamic model of hard rods. 
Importantly, the amount of training data, L, plays the role of system size. Furthermore, 
the negative log-likelihood, — log P(S\9) is, up to a factor of L just the pressure of the gas 
of hard rods [8j. In what follows, we exploit the relationship between system size and the 
quantity of training data to use insights from finite-size scaling to better understand how 
much data one needs to learn small differences. The relationship between HMMs and the 
statistical mechanics of hard rods is summarized in Table [I] 



B. HMMs, Position- Weight Matrices, and cutoffs 

The matrix of parameters, 6j a , defined in equation ([3]), are often referred to in bioin- 
formatics as the Position Weight Matrix (PWM) PWMs are the most commonly 

used bioinformatic method for discovering new binding sites. In PWM-based approaches, 
sequences, S, whose binding energies, E(S) = e ■ S are below some arbitrary threshold, , 
are considered binding sites. This points to a major shortcoming of PWM based methods- 
namely the inability to learn a threshold directly from data. A major advantage of HMM 
models over PWM-only approached is that HMMs learn both a PWM, e, and a natural 
"cutoff" through the chemical potential \l = logz [6|. In terms of the corresponding hard- 
rod model, the probability, Pb s (S), for a sequence, S, to be a binding site takes the form of 
a Fermi-function, 

p US) = TT ^- (io) 
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TABLE I. Relationship between HMMs and the statistical mechanics of hard-rods. 





HMMs 


hard-rods 


L 


Size of training data 


System size 


S a 

3 


Nucleotide sequence 


Disorder 


hi (a) 

3 V / 


Symbol Probability 


Binding Energy 




Switching rates 


Fueacity 


p{s\e) 


Probability 


Partition Function 


log P(S\8) 

o \ w | v } 


Log-likelihood 


Pressure 


W)\i3 


Fisher Information 


Correlation Functions 




Dynamic Prog. 


Transfer Matrices 




Expectation Maximization 


Variational Methods 



If one makes the reasonable assumption that a sequence S is a binding site if Pb s {S) > 1/2, 
we see that /i serves as a natural cut-off for binding site energies 6|. Thus, the switching 
probabilities of the HMM can be interpreted as providing a natural cut-off for binding 
energies through ([6]). This points to a natural advantage of HMMs over PWM-only approach, 
namely one learns the threshold binding energy for determining whether a sequence is a 
binding site self-consistently from the data. Thus, though in practice binding sites are 
dilute in the DNA and hard-rod constraints can often be neglected, it is still beneficial to 
use the full HMM machinery for binding site discovery. 



III. FISHER INFORMATION & LEARNING WITH FINITE DATA 
A. Fisher Information and Error-bars 

In general, learning the parameters of an HMM from training data is a difficult task. 
Commonly, parameters of an HMM are chosen to maximize the likelihood of observed data, 
S, through Maximum Likelihood Estimation (MLE), i.e. parameters are chosen so that 

9 = argmax£(iS|#) = argmaxlogP(5|#). (11) 



S 




► 

INCREASING AMOUNTS OF DATA 

FIG. 2. The log-likelihood becomes peaked around true parameters with increasing data, analogous 
to finite-size scaling of pressure times volume (logarithm of the grand canonical partition function) 
in the corresponding statistical mechanical model. 



Finding the global maxima is an extremely difficult problem. However, one can often find 
a local maximum in parameter space, 9, using Expectation Maximization algorithms such as 
Baum- Welch . In general, for any finite amount of training data, the learned parameters 
9 (even if they are a global maxima) will differ from the "true" parameters 9t- The reason 
for this is that the probabilistic nature of HMMs leads to 'finite size' fluctuations so that 
the training data may not be representative of the data as a whole. These fluctuations are 
suppressed asymptotically as the training data size approaches infinity. For this reason, it 
is useful to have a measure of how well the learned parameters 9 describe the data. 

In the remainder of the paper, we assume there is enough training data to ensure that we 
can consider parameters in the neighborhood of the true parameters . The mapping between 
HMMs and the statistical mechanics of hard rods allows us to gain insight into the relation- 
ship between the amount of training data and the confidence in learned parameters. Recall 
that log-likelihood per unit volume, C(S\9)/L, is analogous to a pressure and the amount 
of training data is just the system size. From finite-size scaling in statistical mechanics, we 
know that as L — > oo the log-likelihood/pressure becomes increasingly peaked around its 
true value (see Fig. [2]). In addition, we can approximate the uncertainty we have about 
parameters by calculating the curvature of the log-likelihood, d AB C(S\9), around 9 where 
8a denotes the derivative with respect to the v4-th parameter. 

This intuition can be formalized for MLE using the Cramer- Rao bound which relates the 
covariance of estimated parameters to the Fisher Information (FI) Matrix, Zab(9), defined 
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by 

[1(6)} AB = -E [d 2 AB C(S\6)] (12) 

where Eg[g(S)) = J2sP(S\9)g(S) ({9} and see Appendix). An important property of the 
Fisher information is that it provides a bound for how well one can estimate the parameters 
of the likelihood function by placing a lower bound on the covariance of the estimated 
parameters. The Cramer- Rao bound relates the Fisher information to the expected value of 
an unbiased estimator, Eg[9(S)], and the covariance matrix of the estimator, 

[Cov e (9)] AB = Eo[(6 A (S) - 6 A )(6 B (S) - 9 B )], 

through the inequality 

[Cov e (§)} > [T{0)\-\ (13) 

For MLE, the Cramer-Rao bound is asymptotically saturated in the limit of infinite data. 
Thus, we expect the Fisher Information to be a good approximation for Cov^] when the 
amount of training data is large. In the limit of large data, the pressure, or equivalently 
C(S\9), "self-averages" and we can ignore the expectation value (1A4j) . Thus, to leading order 
in L, one can approximate the covariance matrix as 

[Covo(9))ab « IZ(9)]ab « " [d 2 AB C{S\9)Y l , (14) 

in agreement with intuition from finite size scaling. The previous expression provides a way 
to put error bars on learned parameters. However, in practice we seldom have access to 
the "true" parameters 9 that generated the observed sequences. Instead, we only know the 
parameters learned from the training data, 9. Thus, one often substitutes 9, our best guess 
for the parameters 9 in Eq. ffT4"j) . 

B. Fisher Information as correlation functions 

It is worth noting that the expression above, in conjunction with the mapping to the 
hard-rod model, allows us to calculate error bars directly from data. In particular, we show 
below that the Fisher information can be interpreted as a correlation function and thus 
can be calculated using Transfer Matrix techniques. It is helpful to reframe the discussion 
above in the language of the statistical mechanics of disordered systems. Recall that up to 
a normalization constant, C(S, 9), in the corresponding hard-rod model P(S\9) is the grand 
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canonical partition function, C(S\8)/L is the pressure, and the amount of training data, L, 
is just the size of the statistical mechanical system (see Table 1 of main text). When L is 
large, we expect that the sequence S self-averages and the Fisher Information is related to 
the second derivative of the log-likelihood of the observed data, 

[1(0)Ub « -d 2 AB C(S\6). (15) 

Thus, aside from the the normalization C(S,6), the Fisher Information can be calculated 
from the second derivative of the pressure. From the fluctuation dissipation theorem, we 
conclude that the Fisher information can be expressed in terms of connected correlation 
functions. In particular, let O a be the operator conjugate to the A-the parameter, 9 a in the 
partition function Z${9). The Fisher Information then takes the form 

W)}ab = (0 A B ) c -d AB logC{S,9). (16) 

where (O a O b ) c = (O a O b ) - (O a )(O b ) and 

(O) = ^ n= °^"" (17) 

Note that these correlation functions can be calculated directly from the data using Transfer 
Matrix techniques without resorting to more complicated methods. 

In general, the background statistics of the DNA are known and the parameters one 
wishes to learn are the switching rates, a^-, and symbol observation probabilities, bj(a). In 
practice, it is often more convenient to work with the fugacity, z, rather than the switching 
rate (see Table 1). The operator conjugate to the fugacity is n, the number of binding sites. 
Consequently, 

[Z(0)U = ((n-(n)) 2 ) + d zz logC(S,9). (18) 

Thus, the uncertainty in the switching rates is controlled by the fluctuations in binding 
site number, as is intuitively expected. One can also derive the conjugate operators for 
the emission probabilities bj(a) and/or the sequence dependent "binding energies" 6j a (see 
Table 1) via a straight forward calculation (see calculations in sections below). 

The expression f )16p provides a computationally tractable way to calculate the Fisher 
Information and, consequently, the covariance matrix [Cov e(9)] A B- Not only can we learn 
the maximum likelihood estimate for parameters, we can also put 'error bars' on the MLE. 
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We emphasize that in general, this requires powerful, computationally intensive techniques. 
However, by exploiting transfer matrix/ dynamic programming techniques, the correlation 
functions (I16p can be computed in polynomial time. This result highlights how thinking 
about HMMs in the language of statistical mechanics can lead to interesting new results. 

IV. ANALYTIC EXPRESSION USING A VIRIAL EXPANSION 

In general, calculating the log-likelihood C(S, 9) analytically is intractable. However, 
we can exploit the fact that binding sites are relatively rare in DNA and perform a Virial 
expansion in the density of binding sites, p, or in the HMM language, the switching rate 
from background to binding site (aoi <C 1). This is a good approximation in most cases. 
For example, for the NF-kB TF family, a m was recently found to be of order 10~ 2 — 10~ 4 js] 
Thus, to leading order in p, we can ignore exclusion effects due to overlap between binding 
sites and write the partition function of the hard-rod model as 

L 

z S {o) yi } i z~ E(xi)zn « n^ 1 + ze ~ E{sn ) + <p 2 ) ( i9 ) 

n=0 x\...x n <t=1 

where E(S U ) is the binding energy, (j2J), for a hard-rod bound to a sequence, S a , of length I 
starting at position a on the full DNA sequence S. The corrections due to steric exclusion 
are higher order in density and thus can be ignored to leading order. Thus, the log-likelihood 
takes the simple form 

C{S\B) « lo S ( X + ze ~ E{Sn ) - lo S C ( S , °)- (2°) 

17 

where C(S,9) is a normalization constant. 

Notice the log-likelihood (1201) is a sum over the free-energies of single particles in potentials 
given by the observed DNA. For long sequences where L> 1, we expect on average iV = La^i 
binding sites, and L — N background DNA sequences in the sum. In this case, we expect 
that the single particle energy self-averages and we can replace the sum by the average value 
of the single-particle free energy in either background DNA or a binding site. In particular, 
we expect that 

C(S\6) « A(log (1 + ze~ E ^)) bs + (L-N) (log (1 + ze~ E ^)) bg - log C(S, 6) (21) 

where {H(S))b g and (H(S))b s are the expectation value of H(S) for sequences S of length / 
drawn from the background DNA and binding site distributions, respectively. 
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A. Maximum Likelihood Equations via the Virial expansion 



We now derive the Maximum-Likelihood equations (MLE) within the Virial expansion 
to the log-likelihood ( 120]) . Recall from (jTTj) that the Maximum Likelihood estimator is the 
set of parameters most likely to generate the data. Thus, we can derive MLE by taking the 
first derivatives of the log-likelihood and setting the expressions to zero. Consider first the 
MLE for the binding energy matrix e^. Since C(S, 8) is independent of the binding energy, 
we focus only on the first term of (I20p . Define the matrix Si a which is one if position i has 
base a and zero otherwise. The MLE can be derived by taking the first derivative 

cr i a. 

= ^Ul>g (1 + ze~ E ^) + MEpS ~ !)1 ( 22 ) 

a id 

where \ are Langrange multipliers that ensure proper normalization of probabilities.. Ex- 
plicitly taking the derivative, using probability conservation, and noticing that J2 a ^ia = 1 
gives 

= P«e-^=pl s) (23) 



EJUS") 

where 

U(S°) = j - (24) 

is the Fermi-Dirac distribution function. 

We can also derive the MLE corresponding to the fugacity. The fugacity depends ex- 
plicitly on the normalization constant C{S,6). Note that in HMMs, C(S,8) = a^ YlaPa 101 
and ensures probability conservation. Since z = a 01 /(l — a 01 ) l+1 , to leading order in a 01 , 
naively log C(S, 8) ~ Llog(l — z). However, choosing this normalization explicitly violates 
probability conservation in the corresponding HMM because we have truncated the Virial 
expansion for the log-likelihood at first order and consequently allowed unphysical configu- 
rations. Since deriving the MLEs requires probability conservation, we impose by hand that 
the normalization has the z dependence, 

log C(<S,0) ~Llog(l + 2). (25) 

With this normalization, the log-likelihood (|20|) becomes analogous to that for a mixture 
model where the sequences S a are drawn from background DNA or binding sites. With 
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this choice of C(S, 9) the MLE equations can be calculated in a straight forward manner by 
taking the derivative of (120]) with respect to z (see Appendix B) to get 

Lz 



£/«.e(S*) = r 



+ z 



(26) 



B. Fisher Information via the Virial Expansion 

One can also derive an analytic expressions for the Fisher information within the Virial 
expansion. Generally, the background observation probabilities bo(a) = p a are known and 
the HMM parameters, 9, to be learned are the observation symbol probabilities in binding 
sites, bj(a) = and the switching probability a i- Technically, it is easier to work with the 
corresponding parameters of the hard-rod model, the binding energies ej a and the fugacity, 
z. Note that probability conservation and ([3D imply that only three of the ei a (a = A, C, G) 
are independent. A straight forward calculation (see Appendix) yields 



W)~ W ~ N(A iaJ p) bs + (L — N)(A iaJf3 ) 



(27) 



with 



and for i ^ j, 



A ia ,il3 - [fz,e{S)Y 



(bs) (bs) 

X C C I ^ >i ^ c c 
a /3Oi a Oi/3 i Tj^r OiTOiT 

(Pi T ) 2 



Aa,(j^i)(5 = -f z ,e(S)(l - fz,e(S)) 



S. 



(bs) 
Via 



(bs) 



PiT 



(bs) 

c V iP c 



PjT 



where, as above, f z , € (S) is the Fermi-Dirac distribution function 

1 



1 + z -l e E(S)- 



One also has (see Appendix B) 



(28) 



(29) 



[1(9) 



-ii 



N(C ia ) bs + {L-N){C ia ) bg , 



(30) 



with 



and 



Cia — ~ fz,e(S)(l — fz,t(S))Si a , 



[A0Y\z ~ N(D) bs + (L — N)(D) bg , 



(31) 



(32) 
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with 

The expressions ( 127|) , (I30I) , and ( 133|) depend only on e iQ and thus can be used to calculate 
the expected error in learned parameters as a function of training data using only the Position 
Weight Matrix (PWM) of a transcription factor and a rough estimate of the switching 
probability aoi or equivalently the fugacity z. The explicit dependence on base T reflects 
the fact that not all the elements of the PWM are independent. 

V. SCALING RELATION FOR LEARNING WITH FINITE DATA 

An important issue in statistical learning is how much data is needed to learn the parame- 
ters of a statistical model. The more statistically similar the binding sites are to background 
DNA (i.e the smaller the binding energy of a TF), the more data is required to learn the 
model parameters. The underlying reason for this is that the probabilistic nature of HMMs 
means that the training data may not be representative of the data as a whole. Intuitively, 
it is clear that in order to be able to effectively learn model parameters, the training data 
set should be large enough to ensure that "finite-size" fluctuations resulting from limited 
data cannot mask the statistical differences between binding sites and background DNA. To 
address this question, we must consider PWMs learned from strictly random data. As the 
size of the training set is increased, the finite-size fluctuations are tamed. Our approach is 
then, in a sense, complementary to looking for rare, high-scoring sequence alignments which 



become more likely as L increases in random data 14|. Of course, estimations based on 



random data neglect non-trivial structure of real sequences [151 ]. 



A. Maximum Likelihood and Jeffreys priors 

Within the Maximum Likelihood framework, the probability that one learn a ML estima- 
tor, 9, given that the data is generated by parameters 9, can be approximated by a Gaussian 
whose width is related to the Fisher information using a Jeffreys prior jl6| . 



P{9) oc v^Z^e-^^^-H (34) 

As expected, the width of the Gaussian is set by the covariance matrix for 6, and is re- 
lated to the second derivative of the log-likelihood through f lT4|) . Since the log-likelihood-in 
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analogy with the pressure (times volume) of the corresponding hard-rod gas-is an extensive 
quantity, an increase in the amount of training data L means a narrower distribution for the 
learned parameters 9 (see Fig. [2]). When L is large, the inverse of the Fisher information 
is well approximated by the Jacobian of the log-likelihood, (|14p . In general, the Jacobian 
is a positive semi-definite, symmetric square matrix of dimension n, with n the number of 
parameters needed to specify the position weight-matrix and fugacity for a single TF. In 
most cases, n is large and typically ranges from 24 — 45, with the exact number equal to 
three time the length of a binding site. 

Label the v4-th component of 9 by 9 a- Then, the probability distribution f[3~4"j) can also 



rzl 



be used to derive a distribution for the Mahalanobis distance 

^ 2 = - £& - M^^kfe - M- (35) 

The Mahalanobis distance is a scale-invariant measure of how far the learned parameters 
9 are from the true parameters 9. Intuitively, it measures distances in units of standard 
deviations. Furthermore, the Mahalanobis distance scales linearly with the amount of 
data/system size L since it is proportional to log-likelihood C(S\9 ) (i.e. pressure times 
volume). By changing variable to the eigenvectors of the Jacobian, normalizing by the 
eigenvalues, and intergrating out angular variables, one can show that f )34|) yields the fol- 
lowing distribution for f, 

P(r) oc r n ~ l e- f2 = e -r 2 +(n-i)io g r _ ^ 

When n is large, we can perform a saddle-point approximation for r around its maximum 
value, 

K = y/(n-l)/2. (37) 



Writing f = r* + 5f, one has 



P(Sf) « e -(™-D/2+log(n-l)/2 e -^ (3g) 



Thus, for large n, almost in all cases the learned parameters 9 will be peaked sharply around 
a distance, = (n — l)/2, with a width of order 1. This result is a general property of 
large- dimensional Gaussians and will be used below. 
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B. Scaling Relation for learning with finite data 



We now formulate a simple criteria for when there is enough data to learn the binding 
sites of a TF characterized by a PWM e. We take as our null hypothesis that the data was 
generated entirely from background DNA (i.e the true parameters are eo = and Zq = z) 
and require enough data so that the probability of learning e = e be negligible. In other 
words, we want to make sure that there is enough data so that there is almost no chance 
of learning e = e for data generated entirely from background DNA, eo = 0. From (138]) . we 
know that for large n, with probability almost 1 due to finite size fluctuations, any learned 
e will lie a Mahalanobis distance, r 2 (e,z) = (n—l)/2 away from the true parameters. Thus, 
we require enough data so that 

r 2 (e,z) = Lf 2 (e,z) > (n - l)/2, (39) 

with f 2 (e, z) defined by the first equality. An explicit calculation of the left hand side of ( 1391) 
yields (see Appendix) 

(40) 

where we have defined 

E p*&. ^ = 1 ' 2 - ( 41 ) 

a=A,C,G 

Together, (139 p and ( l4"Uj) define a criteria for how much data is needed to learn the binding 
sites of a TF with PWM (binding energy), e, whose binding sites occur in background DNA 
with a fugacity z. Notice that ( |4T)]) contains terms that scale as the square of the energy 
difference, indicating that it is much easier to learn binding sites with a few large differences 
than many small differences. 

VI. DISCUSSION 

In this paper, we exploited the mathematical equivalence between HMM models for TF 
binding and the "inverse" statistical mechanics of hard rods in a one-dimensional disordered 
potential to investigate learning in HMMs. This allowed us to derive a scaling principle 
relating the specificity (binding energy) of a TF to the minimum amount of training data 
necessary to learn its binding sites. Thus, we were led to consider a well-studies physics 
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r 2 (z,e) 



l + z) 



E 



— r 2 

Pt 



problem [8j— the statistical mechanics of a one- dimensional gas of hard rods in an arbitrary 
external potential-from an entirely new perspective. 

In this paper, we assumed that there was enough data so that we could focus on the 
neighborhood of a single maximum in the Maximum Likelihood problem. However, in 
principle, for very small amounts of data, the parameter landscape has the potential to be 
glassy and possess many local minima of about equal likelihood. However in our experience, 
this does not seem to be the case in practice for most TFs. In the future, it will be interesting 
to investigate the parameter landscape of HMMs in greater detail to understand when they 
exhibit glassy behavior. 

The work presented here is part of a larger series of works that seeks to use methods from 
"inverse statistical mechanics" to study biological phenomenon 18l42l|. Inverse statistical 



mechanics inverts the usual logic of statistical mechanics where one starts with a microscopic 
Hamiltonian and calculates statistical properties such as correlation functions. In the inverse 
problem, the goal is to start from observed correlations and find the Hamiltonian from which 
they were most likely generated. In the context of binding site discovery, considering the 
inverse statistical problem allows us to ask and answer new and interesting questions about 
how much data one needs to learn the binding sites of a TF. In particular, it allows us to 
calculate error bars for learned parameters directly from data and derive a simple scaling 
relation between the amount of training data and the specificity of TF encoded in its PWM. 

Our understanding of how the size of training data affects our ability to learn the pa- 
rameters in inverse statistical mechanics is still in its infancy. It will be interesting to see 
if the analogy between finite-size scaling in the thermodynamics of disordered systems and 
learning in inverse statistical mechanics holds in other systems, or if it is particular to the 
problem considered here. More generally, it will be interesting to see in methods from physics 
and statistical mechanics yield new insights about large data sets now being generated in 
biology. 
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Appendix A: Covariance Matrix and Fisher Information 

The Fisher information is a commonly employed measure of how well one learns the 
parameters, 9, of a probabilistic model from training data, S. In our context, S is the 
observed DNA sequence and 9 are the parameters of the HMM for generating DNA se- 
quences. The Fisher information matrix, X AB [9\ is given in terms of the log-likelihood, 
C{S\9)=\og P {S\9),hy 

[X{9)]- A l B = E e [d A £(S\9)d B £(S\9)} 

= Y,P^\9)d A C(S\9)d B C(S\9). (Al) 
s 

where Eg denotes the expectation value averaged over different data sets generated using 
the parameters 9 and d A denotes the partial derivative with respect to the A-th component 
of 9 . 

The Fisher information can also be expressed as a second derivative of the log-likelihood 
function. This follows from differentiating both sides of the equation 

^e £ ^ = l (A2) 
s 

with respect to 9 a and 9b which yields the expression 

e c(m d A C(S\9)d B C(S\9) + ^ e c{m d 2 AB C(S\9) = 0. (A3) 
s s 

Comparing with ( lAlj) . we see that the Fisher information can also be expressed as 

W)\ab = -Ee [d\ B C{S\9)] = - Y,P{S\9)d\ B C{S\9). (A4) 

s 

An important property of the Fisher information is that it provides a bound for how 
well one can estimate the parameters of the likelihood function. As discussed in the main 
text, the parameters of a HMM can be estimated from an observed sequence, S, using a 
Maximum Likelihood Estimator (MLE), 9(S), defined as 

0(S) ee argmax£(S|#). (A5) 

9 
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The Cramer- Rao bound relates the Fisher Information to the expected value of the estimator 

E e [6(S)] = J2HS)p(S\e)i (A6) 
s 

and the covariance matrix of the estimator, 

[Cov e (9)] AB = E e [(9 A (S) - 9 A )(9 B (S) - e B )\. 
For a multidimensional estimator, the Cramer-Rao bound is given by 

[Gov e (e)] AB > J2dBE e (6 c )ll(6)] CD d B E e (9 D ). (A7) 

C,D 

To gain intuition, it is worth considering the special case where the estimator is unbiased, 
E S [9(S)] = 9, in which case the Cramer- Rao bound simply reads 

[Cov e (9)] AB > [X(#rV- (A8) 

Thus, the Fisher information gives a fundamental bound on how well one can learn the 
parameters of our HMM. 

For MLEs, the Cramer-Rao bound is asymptotically saturated in the limit of infinite 
data. Thus, we expect the Fisher Information to be a good approximation for Cov<j[#] when 
the length, L, of the DNA sequences, S, from which we learn parameters is long. In this 
case, 

[Cov e (9)] AB » [Z(tf)]- 1 (A9) 

The previous expressions provide a way to put error bars on learned parameters. However, 
in practice we never have access to the "true" parameters 9 that generated the observed 
sequences. Instead, we only know the parameters learned from the training data, 9. Thus, 
we make the additional approximation 

[Cov e (9)] AB » [Z(tf)]- 1 » [T{9)]-\ (A10) 

Appendix B: Calculation of Fisher Information using a Virial Expansion 
1. PWM dependent Elements 

We now calculate the Fisher information for a HMM for binding sites from a single binding 
site distribution using the Virial expansion. We are interested in the Fisher information for 
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the parameters ej Q (the energies in the corresponding Position Weight Matrix). An important 
complication is that not all the are independent. In particular, we have 



(Bl) 



Thus, there are only three independent parameters at each position in the binding site. Let 
us choose €iT to depend on the other three energies. Rearranging the equation above, one 
has that 

1 — J2ctytTPa e 



<-,T 



e iT = - log I : I = g 

\ Pt J 

Taking the first derivative of (l20i) with respect to ej a with a ^ T yields 

dC{S\6) 



(B2) 



OCT i ^9 ejT OCT 

oe in 



(B3) 



Taking the second derivative yields 



OCT , ^9tjT OCT 



OCT , ^9tjT OCT 



d 2 a 

-0ijO iT J Zt e{O 



deiadeip 
(B4) 



We can simplify the expressions further by noting 



dg eiT 


(bs) 
Pia 


de ia 


PiT 


d 2 9e„ 
dei a eij3 


(bs) 

X Pia 
° a P ( bs ) 
PiT 



+ 



Pia Pi/3 



(B5) 
(B6) 



Plugging in these expressions into (1B4|) yields 



d 2 C{S\d) 

deiadej/3 



j2fzAsna-f z As a )) 

(7 

,S bs ) 

$ijS iT f z ,e{S ) 



S: 



(bs) 
Pia Oct 
(bs) D iT 



PiT 



(bs) 

OCT PjP OCT 

(bs) ^jT 

P)t . 



Jbs) (bs) 
, Ptol | Pi/3 



(B7) 



The Fisher information is obtained in the usual way from 

d 2 £(S\6) 



-l 



(B8) 
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a. Simplified Equations for i = j 



When % = j, we can simplify the equations above using the ML equations ( 12 3 p and noting 
that SiaSip = SapSia and S ia SiT = 0. Using the expressions above yields 

(6s) (6s) (6s)" 



™> =EW5 *)( W ^)> 



From the MLE ([23]), we know that 



(6s) (6s) 

C X I Pil 3 Oct 
OiaOaP H — <-X 



(6s) v 2 



0< *£ (6s) + / (6sK 2 
(B9) 



\ f ( qcr\ Oct rig 



E-M^')^ 



/ 



'eBS 



E^o^ 



Plugging this into the equations above yields 



(6s) (6s) 

CT OCT , "»/8 C cr 



de ia deip 

This is the operator An in the main text 



(P.. 



OCT qct 
(6s) v 2 D iT^T 



j^Msn (bii) 



6. Simplified Equations for i ^ j 

In this case, we know that 

d 2 L(S\6) 
Ae ia Ae j/3 



(6s) 



(6s) 

QCT -^30 QCT 

D i/3 (6s) D JT 



E Aa,j/3(5" T ) 

(B12) 



This is the operator Am in the main text. 



2. Fugacity dependent Elements 



We start by calculating the elements Ti az . As before, within the virial expansion 



£(<S|0)«E 1O S( 1 + 



ze 



Llogfl + A 



(B13) 



Thus, we have 



(1 + *)' 



(B14) 
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Taking the second derivate with respect to ti a yields 



de ia z ^ (1 + ze tb ) 2 ^ z \ Pit J 



Furthermore, one has 



8j . E^P + I^. (B16) 



3. Relating expressions to those in main text 

The equations in the main text follow by noting that the sum over a can be replaced by 
a sum over expectation value over sequences S a drawn from the binding site distribution 
and background DNA. For an arbitrary function, H(S), of a sequence S of length I, 



(J a&BS aGBG 

« N(H(S)) bs + (L — N)(H(S)) bg , (B18) 



with N the expected number of binding sites in a sequence of length L, and where (H(S))b g 
and (H(S))b S are the expectation value of H(S) for sequences S of length I drawn from the 
background DNA and binding site distributions, respectively. Combining (1B18j) and 
with the expressions above yields the equations in the main text. 



Appendix C: Derivation of the scaling relationship 

To derive the scaling relationship, we must calculate the quantity 



a/ \ d 2 C(S\6) ^ d 2 C(S\6) d 2 C a 

ija/3 JH ia 
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where all the second derivatives are evaluated at e = 0. Plugging (1B7I) . (1B15I) . and (1B16j) 
into the expression above, one has 



Ccr 



Pt 



— T 2 ' 
e 2 + — 
Pt. 



E CO- _ e ^iT 



+ V /,,e= (^)(l - /,, e=0 (^))- - ^ 



(C2) 



where we have defined 



E 



7 

Pia^-iai 



(C3) 

with 7 = 1,2 and used the fact that piT = Pt when e = 0. When L is large, we can replace 
the sum over a by an expectation value in background DNA, 



Furthermore, 



(Sir) = PiT 

(SiaSjp) = PiaPj/3(l — 6ij) + Pia&ijbap 
(SiaSjr) = PiaPjT^ - 5ij) 

(S iT S jT ) = p iT p jT {l - 5ij) +p iT 5 ij . 



(C4) 



(C5) 



Plugging these expressions into (1C3|) . noting that the third term averages to zero, and 
simplifying yields 



r 2 (z,e) 



Lz 2 



^ Pt 



(C6) 



(1 + ^) 2 

Finally, it is often helpful to define a rescaled version of r 2 (z, e) that makes the dependence 
of L explicit, 

r 2 (z,e)^ r -^ (C7) 
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